This notebook accompanies the QUANTBIO2019 workshop. It contains placeholders (...) to fill the code in exercises presented during the workshop.
You can knit this document into an HTML file. Once you fill in the code, change the parameter eval=F to eval=T.
Read data from a single experiment from this file ./data/original/tCoursesSelected_EGFsust_exp20170316.csv.gz. Use data.table::fread function.
# load required R packages
require(R.utils)
require(data.table)
# read the file
dtEGF = ...
# view first couple of lines of this dataset
head(dtEGF)
The experimental description is located in this Excel file ./data/original/experimentDescription_EGFsust.xlsx.
Use readxl::read_xlsx function.
require(readxl)
# the as.data.table function converts the tibble returned by read_xlsx function to a data table.
dtEGFexp = as.data.table(...)
head(dtEGFexp)
Data from all experiments merged into a single file:
require(R.utils)
require(data.table)
dtAll = fread("./data/m1_allGF_wExpDescr.csv.gz")
head(dtAll)
## Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
## 1: EGF 0 0 3
## 2: EGF 0 0 4
## 3: EGF 0 0 8
## 4: EGF 0 0 12
## 5: EGF 0 0 13
## 6: EGF 0 0 15
## Intensity_MeanIntensity_Ratio Channel Stim_Conc
## 1: 1229.246 1 250 ng/ml
## 2: 1279.513 1 250 ng/ml
## 3: 1236.920 1 250 ng/ml
## 4: 1170.460 1 250 ng/ml
## 5: 1150.294 1 250 ng/ml
## 6: 1128.049 1 250 ng/ml
Before proceeding further, execute the cell below. It contains definitions of column names that we will use throughout this workshop.
# create a list with strings for column names
lCol = list(
fov = "Metadata_Series",
frame = "Metadata_T",
trackid = "TrackObjects_Label",
meas = "Intensity_MeanIntensity_Ratio",
measNorm = "Intensity_MeanIntensity_Ratio_Norm",
ch = "Channel",
treatGF = "Stim_Treat",
treatConc = "Stim_Conc",
trackiduni = "TrackObjects_Label_Uni", # we will add this column later
clid = "Cluster_Label" # we will add this column later
)
Create a column with unique time series IDâs. It will become quite handy later on⊠Use paste, paste0, or sprintf functions to concatenate strings from different columns.
dtAll[, (lCol$trackiduni) := ...]
head(dtAll)
summary(dtAll)
## Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
## Length:120086 Min. : 0.000 Min. : 0.00 Min. : 1.00
## Class :character 1st Qu.: 3.000 1st Qu.: 25.00 1st Qu.:11.00
## Mode :character Median : 7.000 Median : 50.00 Median :22.00
## Mean : 7.477 Mean : 49.99 Mean :23.42
## 3rd Qu.:12.000 3rd Qu.: 75.00 3rd Qu.:34.00
## Max. :15.000 Max. :100.00 Max. :80.00
##
## Intensity_MeanIntensity_Ratio Channel Stim_Conc
## Min. : 677.9 Min. :1.000 Length:120086
## 1st Qu.:1196.4 1st Qu.:1.000 Class :character
## Median :1246.3 Median :2.000 Mode :character
## Mean :1257.3 Mean :2.485
## 3rd Qu.:1311.9 3rd Qu.:4.000
## Max. :4060.9 Max. :4.000
## NA's :18
head(dtAll[is.na(get(lCol$meas))])
## Stim_Treat Metadata_Series Metadata_T TrackObjects_Label
## 1: FGF 8 11 1
## 2: FGF 8 11 2
## 3: FGF 8 11 5
## 4: FGF 8 11 6
## 5: FGF 8 11 7
## 6: FGF 8 11 8
## Intensity_MeanIntensity_Ratio Channel Stim_Conc
## 1: NA 3 2.5 ng/ml
## 2: NA 3 2.5 ng/ml
## 3: NA 3 2.5 ng/ml
## 4: NA 3 2.5 ng/ml
## 5: NA 3 2.5 ng/ml
## 6: NA 3 2.5 ng/ml
Check measurements of a single time series
dtAll[get(lCol$trackiduni) == "FGF_08_0001"][[lCol$meas]]
Check the length of all other time series with NAâs.
# make a vector with strings of unique track ids of time series that contain NA's
vsTracksWithNAs = dtAll[is.na(get(lCol$meas))][[lCol$trackiduni]]
vsTracksWithNAs
# Key the table according to a unique track ID
setkeyv(dtAll, lCol$trackiduni)
# calculate the number of time points of tracks with NA's
head(
dtAll[vsTracksWithNAs, .N, by = c(lCol$trackiduni)], n = 12L)
Calculate the number of time points per time series. Check the min and max. The .N calculates the number of rows, e.g.:
summary(
dtAll[, ..., ...][["N"]]
)
Here, we chained two operations on a data.table. First, we calculated the number of rows per time series. This results in a new data.table. We then selected a single column named N from that table by using [[...]] syntax.
Use summary function to calculate the 5-point statistics.
Dataset with interpolated NA's from Milestone 2:
require(R.utils)
require(data.table)
dtAll = fread("./data/m2_allGF_wExpDescr_noNAs.csv.gz")
head(dtAll)
## Stim_Treat Metadata_T Intensity_MeanIntensity_Ratio Stim_Conc
## 1: EGF 0 1183.396 0.25 ng/ml
## 2: EGF 1 1187.431 0.25 ng/ml
## 3: EGF 2 1172.491 0.25 ng/ml
## 4: EGF 3 1176.407 0.25 ng/ml
## 5: EGF 4 1172.421 0.25 ng/ml
## 6: EGF 5 1167.917 0.25 ng/ml
## TrackObjects_Label_Uni
## 1: EGF_12_0002
## 2: EGF_12_0002
## 3: EGF_12_0002
## 4: EGF_12_0002
## 5: EGF_12_0002
## 6: EGF_12_0002
require(ggplot2)
p0 = ggplot(dtAll,
aes_string(x = lCol$frame, y = lCol$meas,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) # parameter alpha adds transparency
p0
Remove single time points above the threshold 2000, and impute them with interpolated data. Replace outlier measurements with NA's and interpolate them with imputeTS::na_interpolation.
# replace time points with measurements above the threshold with NA's
dtAll[... , ...]
# interpolate NA's
require(imputeTS)
dtAll[,
(lCol$meas) := ...,
by = ...]
Remove entire trajectories if the measurement is below 1000. Create a vector with unique track IDâs for time series that have measurements below the threshold. Subset dtAll using that vector and %in operator.
# vector with unique track id's of time series where measurements were below a threshold of 1000
vsTrackOut = unique(dtAll[...][[lCol$trackiduni]])
# Key the table according to a unique track ID column; allows for quick and easy subsetting
setkeyv(dtAll, lCol$trackiduni)
# leave only tracks that are NOT in the outlier vector
dtAll = dtAll[... %in% ...]
# clean
rm(vsTrackOut)
Dataset without outliers from Milestone 3:
dtAll = fread("./data/m3_allGF_wExpDescr_noNAs_noOut.csv.gz")
# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame,
y = lCol$meas,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) + # parameter alpha adds transparency
facet_grid(reformulate(lCol$treatGF, lCol$treatConc))
p0
Add a column to dtAll with a normalised measurement where every time series is divided by the mean of its first 20 time points. Plot normalised data.
A column with the mean of first 20 elements of a group can be added this way:
dt[, newCol := mean(.SD[1:20, meas]), by = uniqueID]
.SD corresponds to a subset of data as defined by by section. Itâs a temporary data table and can be used as such.
# add a column with the mean of the beasline for every time series
dtAll[,
baseline := ..., # add a new column with the mean of first 20 rows of the group
by = ...] # group by unique trajectory
# add a column with normalized measurement
dtAll[,
(lCol$measNorm) := ...]
# remove baseline column
dtAll[, baseline := NULL]
Dataset with a column with normalised measurement.
dtAll = fread("./data/m4_allGF_wExpDescr_noNAs_noOut_norm0-20.csv.gz")
Plot per condition using normalized data:
# same as above; repeated for convenience
p0 = ggplot(dtAll, aes_string(x = lCol$frame,
y = lCol$measNorm,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) + # parameter alpha adds transparency
facet_grid(reformulate(lCol$treatGF, lCol$treatConc))
p0
p1 = p0 +
stat_summary(
aes_string(y = lCol$measNorm, group = 1),
fun.y = mean, geom = "line", group = 1,
colour = 'red', linetype = 'solid', size = 1)
p1
Add themes, e.g. + theme_bw(), or themes available in packages such as ggthemes.
Use + labels() to add the title, subtitle, the caption.
Use + xlab() or + ylab() to control labels of x and y axes.
require(ggthemes)
p1 +
theme_minimal() +
labs(title = "ERK activity in response to sustained GF treatments",
caption = paste0("Created on ", Sys.Date())) +
xlab("Time (min)") +
ylab("ERK activity")
Make an interactive plot. Use plotly::ggplotly function
require(plotly)
...
Plot ERK activity at selected time points: baseline, peak, relaxation period. Visualise as box-, dot-, violin plots, or their combination.
Use %in syntax to select rows, e.g.:
dt[frame %in% c(10, 20, 30)]
Use ggplot2 functions such as:
geom_boxplot()
geom_violin()
geom_dotplot(binaxis = "y",
stackdir = "center",
position = "dodge",
binwidth = .01,
binpositions = "all)
For simplicityâs sake, we will use aes instead of aes_string. Use explicit column names: Metadata_T``,Intensity_MeanIntensity_Ratio_Norm`.
ggplot(dtAll[...],
aes(x = ...,
y = ...) +
... +
facet_grid(reformulate(lCol$treatGF, lCol$treatConc))
Convert dtAll to a matrix in a wide format. Use dcast function
# long to wide format
# every row corresponds to a single time series; column are time points
dtAllWide = dcast(dtAll,
... ~ ..., # place for the formula; LHS - row variable, RHS - column variable
value.var = '...') # place for the column name that will be cast into columns
# obtain row names for later
vsRowNames = dtAllWide[[lCol$trackiduni]]
# convert to a matrix; omit the first column
mAllWide = as.matrix(dtAllWide[, -1])
# assign row names to the matrix (useful for later plotting heatmaps from clustering)
rownames(mAllWide) = vsRowNames
# clean
rm(vsRowNames, dtAllWide)
# check the result
head(mAllWide)
require(gplots)
heatmap.2(mAllWide)
require(RColorBrewer)
heatmap.2(mAllWide,
dendrogram = "row", Colv = F,
trace = "none", density.info = "density",
col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")
Play with parameters of dist and hclust functions. The former calculates the distance matrix between individual time series. Admissable parameters are: euclidean, maximum, manhattan, canberra, binary or minkowski. The latter performs hierarchical clustering based on the distance matrix and builds a dendrogram. Admissable parameters for dendrogram linkage in hclust are: ward.D, ward.D2, single, complete, average, mcquitty, median or centroid.
require(proxy)
require(dendextend)
require(magrittr)
# create a coloured dendrogram using a given distance and a linkage method
myRowDend <- mAllWide %>%
proxy::dist(., "euclidean") %>% # calculate distance
stats::hclust(., "complete") %>% # create a dendrogram
stats::as.dendrogram(.) %>%
dendextend::set("branches_k_color", k = 4) %>% # color k main branches
dendextend::set("branches_lwd", 2) %>% # set line width of the dendrogram
dendextend::ladderize(.) # reorganize the dendrogram
heatmap.2(mAllWide,
dendrogram = "row", Colv = F,
Rowv = myRowDend, # use our coloured dendrogram to order rows
trace = "none", density.info = "density",
col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")
require(d3heatmap)
d3heatmap(mAllWide,
dendrogram = "row", Colv = F,
Rowv = myRowDend,
trace = "none", density.info = "density",
show_grid = F,
col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")
We will use dtwclust and proxy packages to calculate the distance matrix. The window.size adjust the extent of warping. The smaller it is, the faster the algorithmâŠ
# From: https://stackoverflow.com/a/50776685/1898713
require(dtwclust)
require(proxy)
require(dendextend)
require(magrittr)
myRowDend <- mAllWide %>%
proxy::dist(., method = "dtw_basic", pattern = symmetric1, norm = "L1", window.size = 10L) %>%
stats::as.dist(.) %>% # conversion required because dtw_basic returns a cross-distance matrix; it is symmetric, thus we take the lower triangle
stats::hclust(., "complete") %>%
stats::as.dendrogram(.) %>%
dendextend::set("branches_k_color", k = 5) %>%
dendextend::set("branches_lwd", 2) %>%
dendextend::ladderize(.)
heatmap.2(mAllWide,
dendrogram = "row", Colv = F,
Rowv = myRowDend, # use our coloured dendrogram to order rows
trace = "none", density.info = "density",
col = rev(colorRampPalette(brewer.pal(11, "RdYlBu"))(99)),
xlab = "Time (min)", ylab = "Cells", key.xlab = "ERK activity")
After performing hierarchical clustering and after cutting the dendrogram at a desired level, we extract assignments of cluster numbers to individual time series.
Again, you can play with different distance and linkage methods in dist and hclust functions.
clTree <- mAllWide %>%
stats::dist(., method = "euclidean") %>%
stats::hclust(., "complete")
str(clTree)
Cut the dendrogram at a desired level. Here, we cut it at 6. Play with that valueâŠ
clAssign = dendextend::cutree(clTree, k = 6)
head(clAssign)
Convert named vector to a data table.
dtClAssign = as.data.table(clAssign, keep.rownames = T)
setnames(dtClAssign, c(lCol$trackiduni, lCol$clid))
head(dtClAssign)
Merge original time series with cluster assignments for individual time series.
dtAllCl = merge(dtAll, dtClAssign, by = lCol$trackiduni)
# convert cluster label to a factor
dtAllCl[, (lCol$clid) := as.factor(get(lCol$clid))]
head(dtAllCl)
If your are completely lost, hereâs a file with the dataset resulting from previous steps:
dtAllCl = fread("./data/m5_allGF_wExpDescr_noNAs_noOut_norm0-20_cl.csv.gz")
# convert cluster label to a factor
dtAllCl[, (lCol$clid) := as.factor(get(lCol$clid))]
Plot time series in each cluster, include the population mean. Use facetting per cluster.
ggplot(dtAllCl, aes_string(x = lCol$frame, y = lCol$measNorm,
group = lCol$trackiduni)) +
geom_path(alpha = 0.1) +
stat_summary(
aes_string(y = lCol$measNorm, group = 1),
fun.y = ..., # place for funtion to calculate the summary
geom = "line", group = 1,
colour = 'red', linetype = 'solid', size = 1) +
facet_wrap(...) + # place for facetting variable
theme_minimal() +
labs(title = "Clusters of ERK activity dynamic responses",
caption = paste0("Created on ", Sys.Date())) +
xlab("Time (min)") +
ylab("ERK activity (normalised)")
Calculate and plot the composition of experimental conditions with respect to clusters. Perform data aggregation to calculate the number of time series per group, per cluster. The shortcut to calculate the number of rows in data.table is .N, e.g.
dt[, .(nTimeSer = .N), by = group]
Aggregate and assign the result to a new data table dtAllClN:
dtAllClN = dtAllCl[,
...,
by = c(...)]
head(dtAllClN)
Make a bar-plot:
# for percentages on y-axis in ggplot
require(scales)
# The core plot: concentrations on the x-axis, the number of time series on y-axis
p5 = ggplot(dtAllClN, aes_string(x = ..., # places for x and y variables
y = ...))
# Facetting per growth factor
p5 = p5 +
facet_wrap(...)
# Stacked bar plot with bars coloured by cluster number
# Bars are stretched to "fill" the y-axis using the option position = position_fill()
p5 = p5 +
geom_bar(aes_string(...), # place for variable to colour the bars
stat = "identity",
position = position_fill())
# Convert y-axis labels to percentages (0-100%) instead of (0-1)
p5 = p5 +
scale_y_continuous(labels = percent)
# Use a nice colour palette for colour fill
p5 = p5 +
scale_fill_manual("GF:",
values = ggthemes::tableau_color_pal("Color Blind")(6))
# Prettify the plot; add labels, etc
p5 = p5 +
theme_minimal() +
labs(title = "Participation of clusters in experimental conditions",
caption = paste0("Created on ", Sys.Date())) +
xlab("") +
ylab("Percentage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p5
Based on Clustering Validation Statistics.
To work on this section, you need to load data from Milestone 4 and to convert it to wide format.
require(factoextra)
# Silhouette method
factoextra::fviz_nbclust(mAllWide, hcut, method = "silhouette") +
labs(subtitle = "Silhouette method")
# Gap statistic
# nboot = 10 to keep the function speedy.
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
factoextra::fviz_nbclust(mAllWide, hcut, method = "gap_stat",
nstart = 25, nboot = 10)+
labs(subtitle = "Gap statistic method")
Weâll use the package NbClust which will compute, with a single function call, 30 indices for deciding the right number of clusters in the dataset:
require(NbClust)
nb <- NbClust(mAllWide,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "complete",
index ="all")
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Visualise results from NbClust:
# Visualize the result
factoextra::fviz_nbclust(nb) + theme_minimal()
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 5 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 3 proposed 4 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
## * 3 proposed NA's as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 3 .
hc.res <- factoextra::eclust(mAllWide,
"hclust", k = 3, graph = FALSE,
hc_method = "complete",
hc_metric = "euclidean")
# Visualize clusters
factoextra::fviz_cluster(hc.res, geom = "point", frame.type = "norm")
## Warning: argument frame is deprecated; please use ellipse instead.
## Warning: argument frame.type is deprecated; please use ellipse.type
## instead.
factoextra::fviz_silhouette(hc.res)
## cluster size ave.sil.width
## 1 1 893 0.21
## 2 2 274 0.48
## 3 3 20 0.48
A free R/Shiny app available here.